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Abstract 

The genetic code is connection between 64 codons, which are building blocks of 
the genes, and 20 amino acids, which are building blocks of the proteins. In addition 
to coding amino acids, a few codons code stop signal, which is at the end of genes, i.e. 
it terminates process of protein synthesis. This article is a review of simple modelling 
of the genetic code and related subjects by concept of p-adic distance. It also contains 
some new results. In particular, the article presents appropriate structure of the codon 
space, degeneration and possible evolution of the genetic code. p-Adic modelling of 
the genetic code is viewed as the first step in further application of p-adic tools in the 
information sector of life science. 

Key Words: genetic code, p-adic distance, DNA and RNA, codons, amino acids, 
proteins, evolution, information 

1 Introduction 

Francis Crick (1916-2004), who together with James Watson discovered double helicoidal 
structure of DNA, in 1953 announced "We have discovered the secret of life" (Hayes, 
1998). However, the life has still many secrets and the genetic code seems to be the most 
intriguing one. Although the standard genetic code was finally experimentally deciphered 
in 1966, its theoretical understanding has remained unsatisfactory and new models have 
been proposed occasionally. The genetic code is still subject of some investigations from 
mathematical, physical, chemical, biological and bioinformation point of view. However, 
many of these models are rather complicated and do not give complete description and 
understanding of the various properties of the genetic code. 

It is instructive to recall discovery of quantum mechanics. Before its emergence, many 
physical experimental data could not be well described by classical methods. It was nec- 
essary to invent new appropriate physical concepts and to use suitable new mathematical 
methods. It seems that a similar situation should happen in theoretical description of 
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living processes in biological organisms. To this end, p-adic methods seem to be very 
promising tools in further investigation of the life. 

In this article we emphasize the role of p-adic distance. Namely, some parts of a 
biological system can be considered simultaneously with respect to different metrics - the 
usual Euclidean metric, which measures spatial distances, and some other metrics, which 
measure nearness related to some bioinformation (or other) properties. Here we consider 
the genetic code using an ultrametric space, which elements are codons presented with 
some natural numbers and the distance between them is the p-adic one. An ultrametric 
space M is a metric space which distances satisfy strong triangle inequality (also called 
ultrametric inequality), i.e. 

d(x,y) < max{(f(x, z), d(z,y)} 

for any x,y,z £ M. The ultrametric inequality was formulated by Felix Hausdorff in 1934 
and ultrametric spaces were introduced by Marc Krasner in 1944. Ultrametrics is also 
named non- Archimedean metrics. Ultrametric spaces exhibit some exotic properties. The 
first application of ultrametricity was in biological taxonomy. Ultrametricity in pphysics 
(Rammal et al, 1986) was observed in 1984 in the context of the mean field theory of 
spin glasses and it induced a considerable research in many scientific fields (e.g. statis- 
tical physics, neural networks, conformational structure of proteins, diffusion processes, 
hierarchical systems). 

Modelling the genetic code is an opportunity for application of p-adic distance. In 2006 
we introduced (Dragovich B. and Dragovich A., 2006) a p-adic approach to DNA and RNA 
sequences, and to the genetic code. The central point of our approach is an appropriate 
identification of four nucleotides with digits 1, 2, 3, 4 of 5-adic number expansions and 
application of p-adic distances between obtained numbers. 5-Adic numbers with three 
digits form 64 integers which correspond to 64 codons. In (Dragovich B. and Dragovich 
A., 2007) we analyzed p-adic degeneracy of the genetic code. As one of the main results 
that we have obtained is explanation of the structure of the genetic code degeneracy using 
p-adic distance between codons. Paper (Dragovich B. and Dragovich A., 2010) contains 
consideration of possible evolution of the genetic code and some generalizations of p- 
adic modelling of the genetic code. Article (Dragovich, 2009) is related to the role of 
number theory in modelling the genetic code. A similar approach to the genetic code was 
reconsidered on diadic plane (Khrennikov and Kozyrev, 2007). 

p-Adic models in mathematical physics have been actively considered since 1987 (see 
(Brekke et al, 1993; Vladimirov et al, 1994) for early reviews and (Dragovich, 2004; 
Dragovich, 2006; Dragovich et al, 2009) for some recent reviews). It is worth noting 
that p-adic models with pseudodifferential operators have been successfully applied to 
interbasin kinetics of proteins (Avetisov et al, 2002). Some p-adic aspects of cognitive, 
psychological and social phenomena have been also considered (Khrennikov, 2004). 

To have a self-contained and comprehensible exposition of the genetic code, we shall 
first briefly review some basic notions from molecular biology. 
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2 Basic Notions of the Genomics and Proteomics 



One of the essential characteristics that differ a living organism from all other material 
systems is related to its genome. The genome of an organism is its whole hereditary 
information encoded in the desoxyribonucleic acid (DNA), and contains both coding and 
non-coding sequences. In some viruses, which are between living and non-living objects, 
genetic material is encoded in the ribonucleic acid (RNA). Investigation of the entire 
genome is the subject of genomics. The human genome is composed of more than three 
billion DNA base pairs and its 97% is non-coding. 

The DNA is a macromolecule composed of two polynucleotide chains with a double- 
helical structure. Nucleotides consist of a base, a sugar and a phosphate group. The sugar 
and phosphate groups provide helical backbone. There are four bases and they are building 
elements of the genetic information. They are named adenine (A), guanine (G), cytosine 
(C) and thymine (T). Adenine and guanine are purines, while cytosine and thymine are 
pyrimidines. In the sense of information, the nucleotide and its base present the same 
object. Nucleotides are arranged along chains of double helix through base pairs A-T and 
C-G bonded by 2 and 3 hydrogen bonds, respectively. As a consequence of this pairing 
there is an equal number of cytosine and guanine as well as the equal rate of adenine 
and thymine. DNA is packaged in chromosomes which are localized in the nucleus of the 
eukaryotic cells. 

The main role of DNA is to storage genetic information and there are two main pro- 
cesses to exploit this information. The first one is replication, in which DNA duplicates 
giving two new DNA containing the same information as the original one. This is possi- 
ble owing to the fact that each of two chains contains complementary bases of the other 
one. The second process is related to the gene expression, i.e. the passage of DNA gene 
information to proteins. It performs by the messenger ribonucleic acid (mRNA), which is 
usually a single polynucleotide chain. The mRNA is synthesized during the first part of 
this process, known as transcription, when nucleotides C, A, T, G from DNA are respec- 
tively transcribed into their complements G, U, A, C in mRNA, where T is replaced by U 
(U is the uracil, which is a pyrimidine). The next step in gene expression is translation, 
when the information coded by codons in the mRNA is translated into proteins. In this 
process participate also transfer tRNA and ribosomal rRNA. 

Protein synthesis in all eukaryotic cells performs in the ribosomes of the cytoplasm. 
Proteins (Finkelstein and Ptitsyn, 2002) are organic macromolecules composed of amino 
acids arranged in a linear chain. The sequence of amino acids in a protein is determined 
by sequence of codons contained in RNA genes. Amino acids are molecules that consist of 
amino, carboxyl and R (side chain) groups. Depending on R group there are 20 standard 
amino acids. These amino acids are joined together by a peptide bond. Proteins are 
substantial ingredients of all living organisms participating in various processes in cells 
and determining the phenotype of an organism. There are more proteins than genes in 
DNA, because of alternative splicing of genes and translational modifications. In the 
human body there may be about 2 million different proteins. The study of proteins, 
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Table 1. List of 20 standard amino acids used in proteins by living cells. 3-Letter and 
1-letter abbreviations of amino acids, their chemical structure of side chains, polarity and 
hydrophobicity are presented. 



Amino acids 


Abbreviations 


Side Chain (R) 


Polar 


Hydrophobic 


Alanine 


Ala, A 


-CH 3 


no 


yes 


Cysteine 


Cys, C 


-CH 2 SU 


no 


yes 


Aspartate 


Asp, D 


-CH2COOB. 


yes 


no 


Glutamate 


Glu, E 


-{CH 2 ) 2 COOH 


yes 


no 


Phenynalanine 


Phe, F 


-CH2C6H5 


no 


yes 


Glycine 


Gly, G 


-H 


no 


yes 


Histidine 


His, H 


-C ' H 2 -C 3 H 3 N 2 


yes 


no 


Isoleucine 


lie, I 


-CH{CH 3 )CH 2 CH 3 


no 


yes 


Lysine 


Lys, K 


-{CH 2 )iNH2 


yes 


no 


Leucine 


Leu, L 


-C H2C H [C £[3)2 


no 


yes 


Methionine 


Met, M 


-(CH 2 ) 2 SCH 3 


no 


yes 


Asparagine 


Asn, N 


-CH2CONH2 


yes 


no 


Proline 


Pro, P 


-(CH 2 ) 3 - 


yes 


no 


Glutamine 


Gin, Q 


-{CH 2 ) 2 CONH2 


yes 


no 


Arginine 


Arg, R 


-(CH 2 ) 3 NHC(NH)NH 2 


yes 


no 


Serine 


Ser, S 


-CH 2 OH 


yes 


no 


Threonine 


Thr, T 


-CH{OH)CH 3 


yes 


no 


Valine 


Val, V 


-CH{CH 3 ) 2 


no 


yes 


Tryptophan 


Trp, W 


-CH 2 C S H 6 N 


no 


yes 


Tyrosine 


Tyr, Y 


-CH 2 -C 6 H 4 OH 


yes 


yes 
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especially their structure and functions, is called proteomics. The complete proteome is 
the entire set of proteins in an organism. 

Some properties of amino acids are presented in Table 1. For a more detailed and 
comprehensive information on genomics and proteomics one can use book (Watson et al., 
2004) on molecular biology. 

3 General Features of the Genetic Code 

Experimental study of the connection between ordering of nucleotides in DNA (and RNA) 
and ordering of amino acids in proteins led to the deciphering of the standard genetic code 
in the mid-1960s. The genetic code is understood as a dictionary for translation of codons 
from DNA (and RNA) to amino acids during synthesis of proteins. The information on 
amino acids is contained in codons: each codon codes either an amino acid or termination 
signal (see, e.g. Table 2 as a standard table of the vertebrate mitochondrial genetic code). 
To the sequence of codons in RNA corresponds quite definite sequence of amino acids in a 
protein, and this sequence of amino acids determines primary structure of the protein. At 
the time of deciphering, it was mainly believed that the standard code is unique, result of 
a chance and fixed a long time ego. Crick (Crick, 1968) expressed such belief in his "frozen 
accident" hypothesis, which has not been supported by later observations. Moreover, it has 
been discovered so far about 20 different genetic codes. However, differences are not drastic 
and many common general properties have been found: four nucleotides, trinucleotide 
codons, the same mechanism of proton synthesis, ... At the first glance the genetic code 
looks rather arbitrary, but it is not. Namely, mutations between synonymous codons give 
the same amino acid. When mutation alter an amino acid then it is like substitution of 
the original by similar one. In this respect the code is almost optimal. 

The relation between codons, on the one hand, and amino acids and stop signal, from 
the other hand, is known as the genetic code. 

Codons are ordered triples composed of C, A, U (T) and G nucleotides. Each codon 
presents an information which controls use of one of the 20 standard amino acids or stop 
signal in synthesis of proteins. It is obvious that there are 4 x 4 x 4 = 64 codons. 

Although there are about 20 known codes, the most important are two of them: the 
standard code and the vertebrate mitochondrial code. 

In the sequel we shall mainly have in mind the vertebrate mitochondrial genetic code, 
because it is a simple one and the others may be viewed as its slightly modified versions. In 
the vertebrate mitochondrial code, 60 of codons are distributed on the 20 different amino 
acids and 4 codons make termination signal. According to experimental observations, two 
amino acids are coded by six codons, six amino acids by four codons, and twelve amino 
acids by two codons. This property that some amino acids are coded by more than one 
codon is known as genetic code degeneracy. This degeneracy is a very important property 
of the genetic code and gives an efficient way to minimize errors caused by mutations. 

Since there is in principle a huge number (between 10 71 and 10 84 (Hornos J. and Hornos 
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Table 2. The standard (Watson-Crick) table of the vertebrate mitochondrial genetic 
code. Ter denotes the terminal (stop) signal. 



UUU Phe 
UUC Phe 
UUA Leu 
UUG Leu 


UCU Ser 
UCC Ser 
UCA Ser 
UCG Ser 


UAU Tyr 
UAC Tyr 
UAA Ter 
UAG Ter 


UGU Cys 
UGC Cys 
UGA Trp 
UGG Trp 


CUU Leu 
CUC Leu 
CUA Leu 
CUG Leu 


CCU Pro 
CCC Pro 
CCA Pro 
CCG Pro 


CAU His 
CAC His 
CAA Gin 
CAG Gin 


CGU Arg 
CGC Arg 
CGA Arg 
CGG Arg 


AUU lie 
AUC He 
AUA Met 
AUG Met 


ACU Thr 
ACC Thr 
ACA Thr 
ACG Thr 


AAU Asn 
AAC Asn 
AAA Lys 
AAG Lys 


AGU Ser 
AGC Ser 
AGA Ter 
AGG Ter 


GUU Val 
GUC Val 
GUA Val 
GUG Val 


GCU Ala 
GCC Ala 
GCA Ala 
GCG Ala 


GAU Asp 
GAC Asp 
GAA Glu 
GAG Glu 


GGU Gly 
GGC Gly 
GGA Gly 
GGG Gly 



Y., 1993)) of all possible assignments between codons and amino acids, and only a very 
small number of them is represented in living cells, it has been permanent theoretical 
challenge to find an appropriate model explaining contemporary genetic codes. There are 
many papers in this direction scattered in various journals, with theoretical approaches 
based more or less on chemical, biological and mathematical aspects of the genetic code. 
The first genetic model was proposed in 1954 by physicist George Gamow (1904-1968), 
which he called the diamond code. In his model codons are composed of three nucleotides 
and proteins are directly synthesized at DNA: each cavity at DNA attracts one of 20 
amino acids. This is an overlapping code and was ruled out by analysis of correlations 
between amino acids in proteins, but concept of trinucleotide codons was correct. The next 
model of the genetic code was proposed in 1957 by Crick, and is known as the comma-free 
code. This model was so elegant that it was almost universally accepted. However, an 
experiment in 1961 demonstrated that UUU codon codes amino acid phenylalanine, while 
by this code it codes nothing. Gamow's and Crick's models are very pretty but wrong 
- living world prefers actual codes, which are more stable with respect to possible errors 
(for a popular review of the early models, see (Hayes, 1998)). 

Let us mention some models of the genetic code after deciphering standard code. In 
1966 physicist Yuri Rumer (1901-1985) emphasized the role of the first two nucleotides in 
the codons (Rumer, 1966). There are models which are based on chemical properties of 
amino acids (see, e.g. (Swanson, 1984)). In some models connections between number of 
constituents of amino acids and nucleotides and some properties of natural numbers are 
investigated (see (Scherbak, 2003; Rakocevic, 2004; Negadi, 2007) and references therein). 
A model based on the quantum algebra U q (sl(2)®sl(2)) in the q — > limit was proposed as 
a symmetry algebra for the genetic code (see (Frappat et al, 2001) and references therein). 
In a sense this approach mimics quark model of protons and neutrons. Besides some 
successes of this approach, there is a problem with rather many parameters. There are 
also papers, see, e.g. (Hornos J. and Hornos Y., 1993; Forger and Sachse, 2000; Bashford 
et al, 1997) starting with 64-dimensional irreducible representation of a Lie (super)algebra 
and trying to connect multiplicity of codons with irreducible representations of subalgebras 
arising in a chain of symmetry breaking. Although interesting as an attempt to describe 
evolution of the genetic code these Lie algebra approaches did not progress further. For a 
very brief review of these and some other theoretical approaches to the genetic code one 
can see (Frappat et al, 2001). 

Despite of remarkable experimental successes and some partial theoretical descriptions, 
there is no simple and generally accepted theoretical understanding of the genetic code. 
Hence, the foundation of biological coding is still an open problem. In particular, it is not 
clear why genetic code exists just in a few known ways and not in many other possible 
ones. What is origin and evolution of the genetic code? Is there a mathematical principle 
behind genetic coding? We keep in mind these and similar questions trying to find simple 
and general approach, which seems to be p-adic ultrametricity. 
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4 Ultrametric 5-Adic Space 



Before than consider p-adic properties of the genetic code in a self-contained way we shall 
recall some mathematical preliminaries. 

As a new tool to study the Diophantine equations, p-adic numbers are introduced by 
German mathematician Kurt Hensel in 1897. They are involved in many branches of 
modern mathematics. An elementary introduction to p-adic numbers can be found in the 
book (Goueva, 1993). However, for our purposes we will use here only a small portion of 
p-adics, mainly some finite sets of integers and ultrametric distances between them. 

Consider the set of natural numbers 

C 5 [64] = {n + m 5 + n 2 5 2 : n % = 1, 2, 3, 4} , (1) 

where n, are digits different from zero. This is a finite expansion to the base 5. It is 
obvious that 5 is a prime number and that the set Cs[64] contains 64 natural numbers. 
In the sequel we shall often denote elements of C5[64] by their digits to the base 5 in the 
following way: no + n\ 5 + n 2 5 2 = no n\ n 2 . Note that here ordering of digits is the same 
as in the expansion, i.e this ordering is opposite to the usual one. 

It is often important to know a distance between numbers. Distance can be defined by 
a norm. On the set Z of integers there are two kinds of nontrivial norm: usual absolute 
value | • |oo and p-adic absolute value | • \ p , where indices oo and p denote real and p-adic 
case, respectively (p is any prime number). The usual absolute value is well known from 
elementary mathematics and the corresponding ordinary distance between two numbers 
x and y is d^x, y) = \x - y]^. 

The p-adic absolute value is related to the divisibility of integers by prime numbers. 
Difference of two integers is again an integer. p-Adic distance between two integers can 
be understood as a measure of divisibility of their difference by p (the more divisible, the 
shorter). By definition, p-adic norm of an integer m € Z, is \m\ p = p~ k , where k € N|J{0} 
is degree of divisibility of m by prime p (i.e. m = p k m! , p\ m') and |0| p = 0. This norm 
is a mapping from Z into non-negative rational numbers and has the following properties: 

(i) \x\p > 0, \x\ p = if and only if x = 0, 

(ii) \xy\ p = \x\ p \y\ p , 

(iii) \x + y\ p < max {\x\ p , \y\ p } < \x\ p + \y\ p for all x , y G Z. 

Because of the strong triangle inequality \x + y\ p < max{|x| p , \y\ p }, p-adic absolute value 
belongs to non- Archimedean (ultrametric) norm. One can easily conclude that < \m\ p < 
1 for any m 6 Z and any prime p. 

p-Adic distance between two integers x and y is 

d P (x ,y) = \x - y\ p . (2) 

Since p-adic absolute value is ultrametric, the p-adic distance ([2]) is also ultrametric, i.e. 
it satisfies inequality 

d p {x , y) < max {d p (x , z) , d p (z , y)} < d p (x , z) + d p (z , y) , (3) 
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where x, y and z are any three integers. 

5-Adic distance between two numbers a, b € C5 [64] is 

d 5 (a, b) = I ao + ai 5 + a 2 5 2 - b - 61 5 - 6 2 5 2 1 5 , (4) 
where m, bi £ {1, 2, 3, 4}. When a/6 then ds(a, 6) may have three different values: 

• d 5 (a, b) = 1 if a / 6 > 

• ds(a, 6) = 1/5 if ao = 60 and ai 7^ &i, 

• ^5(0, b) = 1/5 2 if ao = 60 1 0-1 = »i and a 2 7^ o 2 . 

We see that the largest 5-adic distance between numbers is 1 and it is maximum p-adic 
distance on Z. The smallest 5-adic distance on the space C5 [64] is 5 -2 . Let us also note 
that 5-adic distance depends only on the first two digits of different numbers a, b € C5 [64]. 

If we use real (standard) distance doo(a 5 6) = |ao + ai 5 + a 2 5 2 — &o — b\ 5 — o 2 5 2 |oo, then 
third digits a 2 and 6 2 would play more important role than those at the second position 
(i.e aiand&i), and digits ao and 60 are of the smallest importance. At real Cs[64] space 
distances are also discrete, but take values 1, 2, • • • , 93. The smallest real and the largest 
5-adic distance are equal 1. While real distance describes metric of the ordinary physical 
space, this p-adic one may serve to describe ultrametricity of the information space. 

Ultrametric space C5 [64] can be viewed as 16 quadruplets with respect to the smallest 
5-adic distance, i.e. quadruplets contain 4 elements and 5-adic distance between any two 
elements within quadruplet is ^g. In other words, within each quadruplet elements have 
the first two digits equal and third digits are different. 

With respect to 2-adic distance, the above quadruplets may be viewed as composed 
of two doublets: a = ao a\ 1 and b = ao a± 3 make the first doublet, and c = ao ai 2 and 
d = ao ai 4 form the second one. 2-Adic distance between codons within each of these 
doublets is |, i.e. 

d 2 (a, b) = |(3 - 1) 5 2 | 2 = \ , d 2 (c, d) = |(4 - 2) 5 2 | 2 = 1 , (5) 

because 3 — 1 = 4 — 2 = 2. By this way ultrametric space Cs[64] of 64 elements is arranged 
into 32 doublets. 

5 5-Adic Codon Space 

It is not difficult to note that ultrametric space of numbers in Cs[64] and distribution of 
codons in Table 2 of the vertebrate mitochondrial code have some similarity. Already at 
the first glance, one can see that both have 64 elements and that there are quadruplets 
with equal the first two blocks of triples of letters and triples of digits. 

Identifying appropriately nucleotides by digits, we obtain the corresponding ultramet- 
ric structure of the codon space in the vertebrate mitochondrial genetic code. We take 
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the following assignments between nucleotides and digits in Cs[64]: C (cytosine) = 1, A 
(adenine) = 2, T (thymine) = U (uracil) = 3, G (guanine) = 4. Ordering 5-adic num- 
bers in increasing way one obtains rearranged codon space and it is presented in Table 3. 
There is now evident one-to-one correspondence between codons in three-letter notation 
and three-digit no«i n 2 number representation of ultrametric space Cs[64]. 

The above introduced set C5 [64] endowed by p-adic distance we shall call p-adic codon 
space, i.e. elements of C5 [64] are also codons denoted by non\n 2 . 

Let us now explore distances between codons and their role in formation of the genetic 
code degeneration. 

To this end let us again turn to Table 3 as a representation of the C5 [64] codon space. 
Namely, we observe that there are 16 quadruplets such that each of them has the same first 
two digits. Hence 5-adic distance between any two different codons within a quadruplet is 

d 5 (a, b) = I ao + ai 5 + a 2 5 2 - a - ai 5 - b 2 5 2 | 5 

= |(o2 - 62) 5 2 | 5 = \(a 2 ~ feJIs |5 2 | 5 = 5" 2 , (6) 

because ao = bo, a± = b± and \a 2 — 62 1 5 = 1- According to ((6|) codons within every 
quadruplet are at the smallest distance, i.e. they are closest comparing to all other codons. 

Since codons are composed of three ordered nucleotides, each of which is either a purine 
or a pyrimidine, it is natural to try to quantify similarity inside purines and pyrimidines, 
as well as distinction between elements from these two groups of nucleotides. Fortunately 
there is a tool, which is again related to the p-adics, and now it is 2-adic distance. One can 
easily see that 2-adic distance between pyrimidines C and U is d 2 (l,3) = |3 — 1| 2 = 1/2 
as the distance between purines A and G, namely d 2 (2,4) = |4 — 2\ 2 = 1/2. However 
2-adic distance between C and A or G as well as distance between U and A or G is 1 (i.e. 
maximum) . 

One can now look at Table 3 as a system of 32 doublets. Thus 64 codons are clustered 
by a very regular way into 32 doublets. Each of 21 subjects (20 amino acids and 1 
termination signal) is coded by one, two or three doublets. In fact, there are two, six 
and twelve amino acids coded by three, two and one doublet, respectively. Residual two 
doublets code termination signal. 

Note that 2 doublets code 2 amino acids (Ser and Leu) which are already coded by 2 
quadruplets, thus amino acids Serine and Leucine are coded by 6 codons (3 doublets). 

To have a more complete picture of the genetic code it is useful to consider possible 
distances between codons of different quadruplets as well as between different doublets. 
Also, we introduce distance between quadruplets or between doublets, especially when 
distances between their codons have the same value. Thus 5-adic distance between any 
two quadruplets in the same column is 1/5, while such distance between other quadruplets 
is 1. 5-Adic distance between doublets coincides with distance between quadruplets, and 
this distance is ^ when doublets are within the same quadruplet. 

The 2-adic distances between codons, doublets and quadruplets are more complex. 
There are three basic cases: 
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• codons differ only in one digit, 

• codons differ in two digits, 

• codons differ in all three digits. 

In the first case, 2-adic distance can be ^ or 1 depending whether difference between digits 
is 2 or not, respectively. 

Let us now look at 2-adic distances between doublets coding Leucine and also between 
doublets coding Serine. These are two cases of amino acids coded by three doublets. One 
has the following distances: 

• d 2 (332, 132) = d 2 (334, 134) = \ for Leucine, 

• d 2 (311,241) = d 2 (313,243) = ± for Serine. 

If we use usual distance between codons, instead of p-adic one, then we would observe 
that two synonymous codons are very far (at least 25 units), and that those which are 
close code different amino acids. Thus we conclude that not usual metric but ultrametric 
is inherent to codon space. 

How degeneracy of the genetic code is connected with p-adic distances between codons? 
The answer is in the following basic p-adic degeneracy principle: Amino acids are coded 
by doublets of codons, where a doublet contains two nucleotides of the smallest (|) 5-adic 
distance and \ 2-adic distance. Here p-adic distance plays a role of similarity: the closer, 
the more similar. Taking into account the standard genetic code, there is a slight violation 
of this principle. Now it is worth noting that in modern particle physics just broken of 
the fundamental gauge symmetry gives its standard model. There is a sense to introduce 
a new principle (let us call it reality principle): Reality is realization of some broken 
fundamental principles. It seems that this principle is valid not only in physics but also 
in all sciences. In this context modern genetic code, especially the standard genetic code, 
is an evolutionary broken the above p-adic degeneracy principle. 

Let us now turn to Table 2. We observe that this table can be regarded as a big 
rectangle divided into 16 equal smaller rectangles: 8 of them are quadruplets which one- 
to-one correspond to 8 amino acids, and other 8 rectangles are divided into 16 doublets 
coding 14 amino acids and termination (stop) signal (by two doublets at different places). 
However there is no manifest symmetry in distribution of these quadruplets and doublets. 

In order to get a symmetry we have rewritten this standard table into new one by 
rearranging 16 rectangles. As a result we obtained Table 3 which exhibits a symmetry 
with respect to the distribution of codon quadruplets and codon doublets. Namely, in 
our table quadruplets and doublets form separately two figures, which are symmetric with 
respect to the mid vertical line (a left-right symmetry), i.e. they are invariant under 
interchange Ch G and A -f-> U at the first position in codons at all horizontal lines. The 
observed left-right symmetry is now invariance under the corresponding transformations 
1 -R- 4 and 2 -H> 3. In other words, at each horizontal line one can perform doublet o 
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Table 3. The p-adic vertebrate mitochondrial genetic code. 



111 CCC Pro 

112 CCA Pro 

113 CCU Pro 

114 CCG Pro 


211 ACC Thr 

212 AC A Thr 

213 ACU Thr 

214 ACG Thr 


311 UCC Ser 

312 UCA Ser 

313 UCU Ser 

314 UCG Ser 


411 GCC Ala 

412 GCA Ala 

413 GCU Ala 

414 GCG Ala 


121 CAC His 

122 CAA Gin 

123 CAU His 

124 CAG Gin 


221 AAC Asn 

222 AAA Lys 

223 AAU Asn 

224 AAG Lys 


321 UAC Tyr 

322 UAA Ter 

323 UAU Tyr 

324 UAG Ter 


421 GAC Asp 

422 GAA Glu 

423 GAU Asp 

424 GAG Glu 


131 CUC Leu 

132 CUA Leu 

133 CUU Leu 

134 CUG Leu 


231 AUC lie 

232 AUA Met 

233 AUU He 

234 AUG Met 


331 UUC Phe 

332 UUA Leu 

333 UUU Phe 

334 UUG Leu 


431 GUC Val 

432 GUA Val 

433 GUU Val 

434 GUG Val 


141 CGC Arg 

142 CGA Arg 

143 CGU Arg 

144 CGG Arg 


241 AGC Ser 

242 AGA Ter 

243 AGU Ser 

244 AGG Ter 


341 UGC Cys 

342 UGA Trp 

343 UGU Cys 

344 UGG Trp 


441 GGC Gly 

442 GGA Gly 

443 GGU Gly 

444 GGG Gly 
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Table 4. 20 standard amino acids with assigned corresponding numbers. 



1 1 Proline 


21 Threonine 


31 Serine 


41 Alanine 


12 Histidine 


22 Asparagine 


32 Tyrosine 


42 Aspartate 


13 Leucine 


23 Isoleucine 


33 Phenynalanine 


43 Valine 


14 Arginine 


24 Lysine 


34 Cysteine 


44 Glycine 


1 Glutamine 


2 Methionine 


3 Tryptophan 


4 Glutamate 



doublet and quadruplet -H- quadruplet interchange around vertical midline. Recall that also 
DNA is symmetric under simultaneous interchange of complementary nucleotides C O G 
and A T between its strands. All doublets in this table form a nice figure which looks 
like letter T. 

Note that the above invariance leaves also unchanged polarity and hydrophobicity of 
the corresponding amino acids in all but three cases: Asn <-> Tyr, Arg O Gly, and Ser <R- 
Cys. 

It is also worth noting that four nucleotides are related to prime number 5 by their 
correspondence to the four nonzero digits (1, 2, 3, 4) of p = 5. It is unappropriate to use the 
digit for a nucleotide because it leads to non-uniqueness in representation of the codons 
by natural numbers. For example, 123 = 123000 as numbers, but 123 would represent one 
and 123000 two codons. This is also a reason why we do not use 4-adic representation for 
codons, since it would contain a nucleotide presented by digit 0. One can use as a digit 
to denote absence of any nucleotide. 

6 5-Adic Amino Acids Space 

At Table 4 we assigned numbers xqx± = xq + x\ 5 to 16 amino acids which are assumed 
to be present in dinucleotide coding epoch, and xq = 1,2,3,4 is attached to four late 
amino acids which were added during trinucleotide coding. Having these 5-adic numbers 
for amino acids one can consider distance between codons and amino acids: there are 23 
codon doublets which are at ^ 5-adic distance with the corresponding 15 amino acids, 
i.e. these codons within doublets and related amino acids are at the same 5-adic distance. 

7 Possible Evolution of the Genetic Code 

There are two types of evolution of the genetic code: 1) evolution of the codon space and 
2) evolution of amino acids sector with fixed trinucleotide codon space. We shall discuss 
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mainly the first type of evolution. 

The origin and early evolution of the genetic code are among the most interesting and 
important investigations related to the origin and evolution of the life. However, since 
there are no concrete fossils from that early period, it gives rise to many speculations. 
Nevertheless, one can hope that some of the hypotheses may be tested looking for the 
corresponding traces in the contemporary genomes. 

It seems natural to consider biological evolution as an adaptive development of sim- 
pler living systems to more complex ones. Namely, living organisms are open systems 
in permanent interaction with environment. Thus the evolution can be modelled by a 
system with given initial conditions and guided by some internal rules taking into account 
environmental factors. 

We are going now to conjecture on the evolution of the genetic code using our p-adic 
approach to the codon space, and assuming that preceding codes used simpler codons and 
older amino acids. 

Consider general p-adic codon space C5 [(p— l) m ] which has two parameters: p - related 
to p — 1 building blocks, and m - multiplicity of the building blocks in codons. Then 

• Case C2 [l] is a trivial one and useless for a primitive code. 

• Case C3 [2 m ] with m = 1, 2, 3 does not seem to be realistic. 

• Case C5 [4 m ] with m = 1,2,3 offers a possible pattern to consider evolution of the 
genetic code. Namely, the codon space could evolve in the following way: C5 [4] — > 
C 5 [4 2 ] -+C 5 [4 3 ] =C 5 [64]. 

According to Table 5 the primary code, containing codons in the single nucleotide 
form (C, A, U, G), encoded the first four amino acids (see Table 6): Gly, Ala, Asp and 
Val. From the last column of Table 4 we conclude that the connection between digits and 
amino acids is: 1 — > Ala, 2 — > Asp, 3 — > Val, 4 — > Gly. In the primary code these digits 
occupied the first position in the 5-adic expansion (Table 5), and at the next step, i.e. 
C5 [4] — > C5 [4 2 ] , they moved to the second position adding digits 1, 2, 3, 4 in front of each 
of them. 

In C5 [4 2 ] one has 16 dinucleotide codons which can code up to 16 new amino acids. 
Addition of the digit 4 in front of already existing codons 1,2,3,4 leaves their meaning 
unchanged, i.e. 41 ->• Ala, 42 ->• Asp, 43 -> Val, 44 — ^ Gly. Adding digits 3, 2, 1 in front 
of the primary 1,2,3,4 codons one obtains 12 possibilities for coding some new amino 
acids. To decide which amino acid was encoded by which of 12 dinucleotide codons, we 
use as a criterion their immutability in the trinucleotide coding on the C5 [4 3 ] space. This 
criterion assumes that amino acids encoded earlier are more fixed than those encoded 
later. According to this criterion we decide in favor of the first row in each rectangle of 
Table 3 and result is presented in Table 7. 

Transition from dinucleotide to trinucleotide codons occurred by attaching nucleotides 
1,2,3,4 at the third position, i. e. behind each dinucleotide. By this way one obtains 
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Table 5. 5-Adic system including digit 0, and containing single nucleotide, dinucleotide 
and trinucleotide codons. 



000 


100 c 


200 A 


300 U 


400 G 


010 
020 
030 
040 


110 cc 
120 CA 
130 CU 
140 CG 


210 AC 
220 AA 
230 AU 
240 AG 


310 UC 
320 UA 
330 UU 
340 UG 


410 GC 
420 GA 
430 GU 
440 GG 


001 


101 


201 


301 


401 


011 
021 
031 
041 


111 ccc 
121 CAC 
131 CUC 
141 CGC 


211 ACC 
221 AAC 
231 AUC 
241 AGC 


311 UCC 
321 UAC 
331 UUC 
341 UGC 


411 GCC 
421 GAC 
431 GUC 
441 GGC 


002 


102 


202 


302 


402 


012 
022 
032 
042 


112 CCA 
122 CAA 
132 CUA 
142 CGA 


212 ACA 
222 AAA 
232 AUA 
242 AGA 


312 UCA 
322 UAA 
332 UUA 
342 UGA 


412 GCA 
422 GAA 
432 GUA 
442 GGA 


003 


103 


203 


303 


403 


013 
023 
033 
043 


113 CCU 
123 CAU 
133 CUU 
143 CGU 


213 ACU 
223 AAU 
233 AUU 
243 AGU 


313 UCU 
323 UAU 
333 UUU 
343 UGU 


413 GCU 
423 GAU 
433 GUU 
443 GGU 


004 


104 


204 


304 


404 


014 
024 
034 
044 


114 CCG 
124 CAG 
134 CUG 
144 CGG 


214 ACG 
224 AAG 
234 AUG 
244 AGG 


314 UCG 
324 UAG 
334 UUG 
344 UGG 


414 GCG 
424 GAG 
434 GUG 
444 GGG 



Ignoring numbers which contain digit in front of any 1, 2, 3 or 4, one has one-to-one 
correspondence between 1-digit, 2-digits, 3-digits numbers and single nucleotides, din- 
ucleotides, trinucleotides, respectively. It seems that evolution of codons has followed 
transitions: single nucleotides — > dinucleotides — > trinucleotides. 
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Table 6. Temporal appearance of the 20 standard amino acids (Trifonov, 2004). 



(1) Gly 


(2) Ala 


(3) Asp 


(4) Val 


(5) Pro 


(6) Ser 


(7) Glu 


(8) Leu 


(9) Thr 


(10) Arg 


(11) lie 


(12) Gin 


(13) Asn 


(14) His 


(15) Lys 


(16) Cys 


(17) Phe 


(18) Tyr 


(19) Met 


(20) Trp 



Table 7. The dinucleotide genetic code based on the p-adic codon space C5 [4 2 ]. 



11 CC Pro 


21 AC Thr 


31 UC Ser 


41 GC Ala 


12 CA His 


22 AA Asn 


32 UA Tyr 


42 GA Asp 


13 CU Leu 


23 AU He 


33 UU Phe 


43 GU Val 


14 CG Arg 


24 AG Ser 


34 UG Cys 


44 GG Gly 



Note that this code encodes 15 amino acids without stop codon, but encodes Serine twice. 
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new codon space C5 [4 3 ] = C5 [64], which is significantly enlarged and provides a pattern 
to generate known genetic codes. This codon space C5 [64] gives possibility to realize at 
least three general properties of the modern code: 

(i) encoding of more than 16 amino acids, 

(ii) diversity of codes, 

(iii) stability of the gene expression. 

Let us give some relevant clarifications. 

(i) For functioning of contemporary living organisms it is necessary to code at least 
20 standard (Table 1) and 2 non-standard amino acids (selenocysteine and pyrrolysine) . 
Probably these 22 amino acids are also sufficient building units for biosynthesis of all 
necessary contemporary proteins. While C5 [4 2 ] is insufficient, the codon space C5 [4 3 ] 
offers approximately three codons per one amino acid. 

(ii) The standard code was deciphered around 1966 and was thought to be universal, 
i. e., common to all organisms. When the human mitochondrial code was discovered in 
1979, it gave rise to believe that the code is not frozen and that there are also some other 
codes which are mutually different. According to later evidences, one can say that there 
are about 20 slightly different mitochondrial and nuclear codes (for a review, see (Knight 
et al., 2001; Osawa et al., 1992) and references therein). Different codes have some codons 
with different meaning. So, in the standard genetic code there are the following changes 
in Table 3: 

• 232 (AUA): Met -> He, 

• 242 (AGA) and 244 (AGG): Ter -> Arg, 

• 342 (UGA): Trp -»■ Ter. 

(iii) Each of the 20 codes is degenerate and degeneration provides their stability against 
possible mutations. In other words, degeneration helps to minimize codon errors. 

Genetic codes based on single nucleotide and dinucleotide codons were mainly directed 
to code amino acids with rather different properties. This may be the reason why amino 
acids Glu and Gin are not coded in dinucleotide code (Table 7), because they are similar 
to Asp and Asn, respectively. However, to become almost optimal, trinucleotide codes 
have taken into account structural and functional similarities of amino acids. 

We presented here a hypothesis on the genetic code evolution taking into account 
possible codon evolution, from 1-nucleotide to 3-nucleotide, and amino acids temporal 
appearance. This scenario may be extended to the cell evolution, which probably should be 
considered as a coevolution of all its main ingredients (for an early idea of the coevolution, 
see (Wong, 1975)). 

8 Concluding Remarks 

There are two aspects of the genetic code related to: 
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(i) multiplicity of codons which code the same amino acid, 

(ii) concrete assignment of codon multiplets to particular amino acids. 

The above presented p-adic approach gives quite satisfactory description of the aspect 
(i). Ultrametric behavior of p-adic distances between elements of the C5 [64] codon space 
radically differs from the usual ones. Quadruplets and doublets of codons have natural 
explanation within 5-adic and 2-adic nearness. Degeneracy of the genetic code in the 
form of doublets, quadruplets and sextuplets is direct consequence of p-adic ultrametricity 
between codons. p-Adic C5 [64] codon space is our theoretical pattern to consider all 
variants of the genetic code: some codes are direct representation of C5 [64] and the others 
are its slight evolutional modifications. 

(ii) Which amino acid corresponds to which doublet of codons? An answer to this ques- 
tion should be expected from connections between physicochemical properties of amino 
acids and anticodons. Namely, enzyme aminoacyl-tRNA synthetase links specific tRNA 
anticodon and related amino acid. Thus there is no direct interaction between amino acids 
and trinucleotide codons, as it was believed for some time in the past. However, from our 
p-adic analysis follows that at an epoch of dinucleotide codons connection of codons and 
amino acids should be direct. Namely, at that time p-adic distance between dinucleotide 
codons and some amino acids was zero. 

Note that there are in general 4! ways to assign digits 1, 2, 3, 4 to nucleotides C, A, U, 
G. After an analysis of all 24 possibilities, we have taken C = l, A = 2, U = T = 3, 
G = 4 as a quite appropriate choice. In addition to various properties already presented 
in this paper it exhibits also complementarity of nucleotides in the DNA double helix by 
relation C + G = A + T = 5. 

One can express many above considerations of p-adic information theory in linguistic 
terms and investigate possible linguistic aspects. 

In this paper we have employed p-adic distances to measure similarity between codons, 
which have been used to describe degeneracy of the genetic code and to propose its evo- 
lution. It is worth noting that in other contexts p-adic distances can be interpreted in 
quite different meanings. For example, 3-adic distance between cytosine and guanine is 
^(1,4) = i, and between adenine and thymine 6^(2,3) = 1. This 3-adic distance seems 
to be natural to relate to hydrogen bonds between complements in DNA double helix: the 
smaller distance, the stronger hydrogen bond. Recall that C-G and A-T are bonded by 3 
and 2 hydrogen bonds, respectively. 

The translation of codon sequences into proteins is highly an information-processing 
phenomenon. p-Adic information modelling presented in this paper offers a new approach 
to systematic investigation of ultrametric aspects of DNA and RNA sequences, the genetic 
code and the world of proteins. It can be embedded in computer programs to explore p-adic 
side of the genome and related subjects. 

The above considerations and obtained results may be viewed as contribution to foun- 
dation of p-adic theory of the genetic code, but also to theory of p-adic information. 
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